Lab 7: Time Series
Introduction
The data comes in various different shapes. Some have the observations spreaded to columns, some given in rows. Mostly you have gaps in the data.
In this lab, we will particularly be interested in time series, which differs from many other data types and has a separate way of analyzing. In this tutorial you will learn how to:
- preprocess the data (clean NAs, change the data structure etc.)
- play with date objects
- plot time series
- analyze trend and seasonality
Getting Started
We will visualize Covid Confirmed Cases dataset by John Hopkins University, and Maddison Project GDP Per Capita dataset. To be able to follow you need:
- to download covid_confirmed.csv and maddison.csv data uploaded on LEARN
- some packages including
tidyverselubridategganimate(takes effort to install)plotly(takes time to install)
## X Province Country Date Confirmed Deaths Recovered
## 1 1 Mainland China 1/22/2020 17:00 1 NA NA
## 2 2 Mainland China 1/22/2020 17:00 14 NA NA
## 3 3 Mainland China 1/22/2020 17:00 6 NA NA
## 4 4 Mainland China 1/22/2020 17:00 1 NA NA
## 5 5 Mainland China 1/22/2020 17:00 NA NA NA
## 6 6 Mainland China 1/22/2020 17:00 26 NA NA
## X Province Country Date Confirmed Deaths
## 277131 277131 Vietnam 2020-06-13 03:33:14 333 0
## 277132 277132 West Bank and Gaza 2020-06-13 03:33:14 489 3
## 277133 277133 Western Sahara 2020-06-13 03:33:14 9 1
## 277134 277134 Yemen 2020-06-13 03:33:14 632 139
## 277135 277135 Zambia 2020-06-13 03:33:14 1321 10
## 277136 277136 Zimbabwe 2020-06-13 03:33:14 343 4
## Recovered
## 277131 323
## 277132 414
## 277133 6
## 277134 28
## 277135 1104
## 277136 51
Preprocessing Data
Datetime in R
Dates and Datetime are special data types that are processed and used in a different fashion. Mostly data has them in a particular format, but that format is not universal. We can write the birthday of Guthrie Govan in many ways, 12/27/1971, 12/27/71, 27.12.1971, 27/12/1971 and so on. Luckily there is a ISO format for the dates in form of 1971-12-27 and this is R’s default format.
If you read a csv file, by default, it will be read as factor. You can convert it as below:
## [1] "character"
## [1] "Date"
## [1] "1971-12-27"
If the date is not in YYYY-MM-DD format, you need to specify the input format:
## [1] "Date"
## [1] "1971-12-27"
You can notice that there is a different language for the format:
| Code | Value | Example |
|---|---|---|
| %d | day | 27 |
| %m | month | 12 |
| %b | month (string) | Dec |
| %B | month (long str.) | December |
| %a | weekday (string) | Mon |
| %A | weekday (long str.) | Monday |
| %y | year (2 digits) | 71 |
| %Y | year (4 digits) | 1971 |
| %H | hour | 21 |
| %M | minute | 30 |
| %S | second | 0 |
So if you have in other formats you can convert it easily:
## [1] "1971-12-27"
If you receive NA, it means it couldn’t process:
## [1] NA
After a good conversion, you can easily carry out mathematical operations:
library('lubridate')
guthrie <- as.Date('12/27/1971', format='%m/%d/%Y')
marco <- as.Date('12/24/1970', format='%m/%d/%Y')
(today() - guthrie)/365## Time difference of 48.52329 days
## Time difference of 49.53151 days
## Time difference of 368 days
If you have datetime data then you can convert it using strptime:
## [1] "1971-12-27 21:30:00 GMT"
## [1] "1971-12-27 21:30:00 GMT"
You can also give any format you want and print:
## [1] "Monday, December 27, 1971 21:30:00"
Lubridate Package
It is not easy to extract some good information by using the above methods. What you need is lubridate:
library('lubridate')
bday <- as.Date('12/27/1971', format='%m/%d/%Y')
c(day(bday), month(bday), year(bday))## [1] 27 12 1971
## [1] 1971 4
## [1] "52" "2" "Monday"
It also has a separate function for datetime, as_datetime, which is doing the same thing as strptime:
## [1] "1971-12-27 21:30:00 GMT"
but it also has some handy functions that requires less effort and very generic:
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27 21:30:00 UTC"
## [1] "1971-12-27"
Earthquake Data
Now we know how to process the data:
## Date Time Latitude Longitude Type Depth Depth.Error
## 1 01/02/1965 13:44:18 19.246 145.616 Earthquake 131.6 NA
## 2 01/04/1965 11:29:49 1.863 127.352 Earthquake 80.0 NA
## 3 01/05/1965 18:05:58 -20.579 -173.972 Earthquake 20.0 NA
## 4 01/08/1965 18:49:43 -59.076 -23.557 Earthquake 15.0 NA
## 5 01/09/1965 13:32:50 11.938 126.427 Earthquake 15.0 NA
## 6 01/10/1965 13:36:32 -13.405 166.629 Earthquake 35.0 NA
## Depth.Seismic.Stations Magnitude Magnitude.Type Magnitude.Error
## 1 NA 6.0 MW NA
## 2 NA 5.8 MW NA
## 3 NA 6.2 MW NA
## 4 NA 5.8 MW NA
## 5 NA 5.8 MW NA
## 6 NA 6.7 MW NA
## Magnitude.Seismic.Stations Azimuthal.Gap Horizontal.Distance Horizontal.Error
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## Root.Mean.Square ID Source Location.Source Magnitude.Source
## 1 NA ISCGEM860706 ISCGEM ISCGEM ISCGEM
## 2 NA ISCGEM860737 ISCGEM ISCGEM ISCGEM
## 3 NA ISCGEM860762 ISCGEM ISCGEM ISCGEM
## 4 NA ISCGEM860856 ISCGEM ISCGEM ISCGEM
## 5 NA ISCGEM860890 ISCGEM ISCGEM ISCGEM
## 6 NA ISCGEM860922 ISCGEM ISCGEM ISCGEM
## Status
## 1 Automatic
## 2 Automatic
## 3 Automatic
## 4 Automatic
## 5 Automatic
## 6 Automatic
## [1] "01/02/1965 13:44:18" "01/04/1965 11:29:49" "01/05/1965 18:05:58"
## [4] "01/08/1965 18:49:43" "01/09/1965 13:32:50" "01/10/1965 13:36:32"
## [1] "character"
## Warning: 3 failed to parse.
## Warning: 3 failed to parse.
## [1] "1965-01-02 13:44:18 UTC" "1965-01-04 11:29:49 UTC"
## [3] "1965-01-05 18:05:58 UTC" "1965-01-08 18:49:43 UTC"
## [5] "1965-01-09 13:32:50 UTC" "1965-01-10 13:36:32 UTC"
Basic operations
In R, there are some operations that you must now by heart. These are selecting row and columns, slicing the data and so on. The “by heart” here means you must know how they work.
Assume you have the following data:
musicians <- data.frame(Name=c('Guthrie Govan', 'Marco Minnemann', 'Tony Levin', 'Steven Wilson', 'Jordan Rudess'),
Instrument=c('guitar','drums','bass','guitar/keyboard', 'keyboard'),
Birthday = c('27.12.1971', '24.12.1970','6.6.1946','3.11.1967', '4.11.1956'))
musicians## Name Instrument Birthday
## 1 Guthrie Govan guitar 27.12.1971
## 2 Marco Minnemann drums 24.12.1970
## 3 Tony Levin bass 6.6.1946
## 4 Steven Wilson guitar/keyboard 3.11.1967
## 5 Jordan Rudess keyboard 4.11.1956
We can convert the Birthdays as to date and calculate their age as we have shown above:
musicians$Birthday <- dmy(musicians$Birthday)
musicians$Age <- as.numeric(today() - musicians$Birthday)/365
musicians## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 48.52329
## 2 Marco Minnemann drums 1970-12-24 49.53151
## 3 Tony Levin bass 1946-06-06 74.09863
## 4 Steven Wilson guitar/keyboard 1967-11-03 52.67397
## 5 Jordan Rudess keyboard 1956-11-04 63.67671
Selecting Columns
Since this is a dataframe, you can use musicians$birthday to choose the third column. But for matrices it wont’ work. There is another notation that you need to know: You can select certain columns and rows using [ , ] notation:
## [1] Guthrie Govan Marco Minnemann Tony Levin Steven Wilson
## [5] Jordan Rudess
## 5 Levels: Guthrie Govan Jordan Rudess Marco Minnemann ... Tony Levin
## Name Instrument Birthday
## 1 Guthrie Govan guitar 1971-12-27
## 2 Marco Minnemann drums 1970-12-24
## 3 Tony Levin bass 1946-06-06
## 4 Steven Wilson guitar/keyboard 1967-11-03
## 5 Jordan Rudess keyboard 1956-11-04
## Name Age
## 1 Guthrie Govan 48.52329
## 2 Marco Minnemann 49.53151
## 3 Tony Levin 74.09863
## 4 Steven Wilson 52.67397
## 5 Jordan Rudess 63.67671
or we can drop the rows 2 and 3 by using ‘-’ sign:
## Name Age
## 1 Guthrie Govan 48.52329
## 2 Marco Minnemann 49.53151
## 3 Tony Levin 74.09863
## 4 Steven Wilson 52.67397
## 5 Jordan Rudess 63.67671
## Name Age
## 1 Guthrie Govan 48.52329
## 2 Marco Minnemann 49.53151
## 3 Tony Levin 74.09863
## 4 Steven Wilson 52.67397
## 5 Jordan Rudess 63.67671
Tidyverse way
There is a special function defined in tidyverse, select for selecting columns:
## Name Birthday
## 1 Guthrie Govan 1971-12-27
## 2 Marco Minnemann 1970-12-24
## 3 Tony Levin 1946-06-06
## 4 Steven Wilson 1967-11-03
## 5 Jordan Rudess 1956-11-04
It also allowing choosing a range of columns using their names:
## Name Instrument Birthday
## 1 Guthrie Govan guitar 1971-12-27
## 2 Marco Minnemann drums 1970-12-24
## 3 Tony Levin bass 1946-06-06
## 4 Steven Wilson guitar/keyboard 1967-11-03
## 5 Jordan Rudess keyboard 1956-11-04
Selecting Rows
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 48.52329
## 2 Marco Minnemann drums 1970-12-24 49.53151
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 48.52329
## 2 Marco Minnemann drums 1970-12-24 49.53151
## 4 Steven Wilson guitar/keyboard 1967-11-03 52.67397
or we can drop the rows 3 and 4 by using ‘-’ sign:
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 48.52329
## 2 Marco Minnemann drums 1970-12-24 49.53151
## 5 Jordan Rudess keyboard 1956-11-04 63.67671
We can also use logical arrays as filters:
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 48.52329
## 2 Marco Minnemann drums 1970-12-24 49.53151
## 4 Steven Wilson guitar/keyboard 1967-11-03 52.67397
The musicians younger than age 50:
## [1] TRUE TRUE FALSE FALSE FALSE
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 48.52329
## 2 Marco Minnemann drums 1970-12-24 49.53151
Guitarists:
## [1] TRUE FALSE FALSE FALSE FALSE
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 48.52329
It doesn’t look okay. Another function, grepl, is very handy when text filtering. It returns a logical vector that tells whether elements of a certain vector contains a particular phrase. For example the musicians playing guitar can be retrieved as:
## [1] TRUE FALSE FALSE TRUE FALSE
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 48.52329
## 4 Steven Wilson guitar/keyboard 1967-11-03 52.67397
Tidyverse way
In tidyverse package there is a special function for it, which does the same thing as subset does. So you may ask why. Don’t ask me.
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 48.52329
## 2 Marco Minnemann drums 1970-12-24 49.53151
and the guitarista extraordinare:
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 48.52329
and the guitarists:
## Name Instrument Birthday Age
## 1 Guthrie Govan guitar 1971-12-27 48.52329
## 2 Steven Wilson guitar/keyboard 1967-11-03 52.67397
Row/Column Bind
Sometimes we want to attach a row or a column to an existing dataset. For example, we may want to add a new row about a new musician as below:
rbind and cbind attaches new row and column respectively:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
## [5,] 101 105 109 113 117
## [6,] 102 106 110 114 118
## [7,] 103 107 111 115 119
## [8,] 104 108 112 116 120
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 5 9 13 17 101 105 109 113 117
## [2,] 2 6 10 14 18 102 106 110 114 118
## [3,] 3 7 11 15 19 103 107 111 115 119
## [4,] 4 8 12 16 20 104 108 112 116 120
Summarizing
In R, there are some predefined functions that work pretty fine and fast. You have to know them if you want to learn R. Here they are:
| function | description | example |
|---|---|---|
| apply() | applies a function to rows(1) or columns(2) | apply(data,2,mean) |
| lapply() | applies a function to using a list of elements | lapply(list_of_data,mean) |
| sapply() | same as lapply, but simplifies to matrix | sapply(list_of_data,mean) |
| summarize() | applies functions only to columns | summarize(data, average=mean(Age)) |
## Year Canada France Germany Italy Japan United.Kingdom United.States
## 1 1950 7291 5186 3881 3172 1921 6939 9561
## 2 1951 7533 5461 4206 3451 2126 7123 10116
## 3 1952 7833 5564 4553 3591 2336 7091 10316
## 4 1953 7984 5684 4905 3830 2474 7346 10613
## 5 1954 7699 5915 5247 3947 2582 7619 10359
## 6 1955 8201 6199 5797 4190 2771 7868 10897
apply
For the average of each country over the whole year range, we need to apply mean to all columns (2):
## Year Canada France Germany Italy
## 1980.00 15685.56 14033.89 13221.28 12147.34
## Japan United.Kingdom United.States
## 12891.92 14202.72 19479.46
The year is not included in the economic indicators. We could drop before apply:
## Canada France Germany Italy Japan
## 15685.56 14033.89 13221.28 12147.34 12891.92
## United.Kingdom United.States
## 14202.72 19479.46
For the average of each year over the whole countries, we need to apply mean function to all rows (1):
## [1] 5421.571 5716.571 5897.714 6119.429 6195.429 6560.429 6776.571
## [8] 6932.143 6976.429 7283.857 7610.143 7863.000 8188.000 8481.000
## [15] 8911.000 9270.429 9681.286 9969.286 10464.000 10960.286 11311.000
## [22] 11608.286 12083.429 12741.429 12827.143 12772.571 13302.714 13680.714
## [29] 14147.143 14635.429 14702.714 14833.714 14767.571 15070.000 15607.286
## [36] 16067.857 16461.286 16929.571 17603.143 18054.143 18168.714 18219.143
## [43] 18344.286 18368.000 18811.143 19158.714 19450.286 19977.571 20364.857
## [50] 20866.286 21544.857 21763.857 21945.286 22192.286 22635.571 22980.286
## [57] 23437.286 23851.143 23658.286 22575.000 23114.571
Sometimes we may need a more special function. For example if we want to calculate the last 5 years’ average for each country, we could write:
## Canada France Germany Italy Japan
## 24966.2 21764.4 20368.0 19146.2 21884.2
## United.Kingdom United.States
## 24231.0 30930.8
Or you can do it more compact, in one line, as:
## Canada France Germany Italy Japan
## 24966.2 21764.4 20368.0 19146.2 21884.2
## United.Kingdom United.States
## 24231.0 30930.8
lapply/sapply
lapply is the apply function for lists, not data frames. Here the input can be an array or a list, and the output will be a list:
## [[1]]
## [1] 15685.56
##
## [[2]]
## [1] 14033.89
##
## [[3]]
## [1] 13221.28
##
## [[4]]
## [1] 12147.34
##
## [[5]]
## [1] 12891.92
##
## [[6]]
## [1] 14202.72
##
## [[7]]
## [1] 19479.46
sapply is the lapply function, but tries to simplify the output into arrays, if it cannot, it returns the same output as lapply. for lists, not data frames. Here the input can be an array or a list, and the output will be a list:
## [1] 15685.56 14033.89 13221.28 12147.34 12891.92 14202.72 19479.46
summarise
The summarise function comes with tidyverse package. It works similar to apply, but can be more flexible for some cases:
## Average SDeviance AvgLast5
## 1 15685.56 5621.451 24954
It is not much handy when you have many columns, but it has a version which works fine:
## Canada France Germany Italy Japan United.Kingdom United.States
## 1 15685.56 14033.89 13221.28 12147.34 12891.92 14202.72 19479.46
summarise_all(econ[ ,-1], list(Average = mean, SDeviance = sd, Last5 = function(x) mean(tail(x,5))))## Canada_Average France_Average Germany_Average Italy_Average Japan_Average
## 1 15685.56 14033.89 13221.28 12147.34 12891.92
## United.Kingdom_Average United.States_Average Canada_SDeviance
## 1 14202.72 19479.46 5621.451
## France_SDeviance Germany_SDeviance Italy_SDeviance Japan_SDeviance
## 1 5350.765 4977.198 5451.625 6955.059
## United.Kingdom_SDeviance United.States_SDeviance Canada_Last5 France_Last5
## 1 5456.816 6965.634 24966.2 21764.4
## Germany_Last5 Italy_Last5 Japan_Last5 United.Kingdom_Last5
## 1 20368 19146.2 21884.2 24231
## United.States_Last5
## 1 30930.8
Also it has summarise_at version which summarizes a given list or range of colums
## Canada Germany Japan
## 1 15685.56 13221.28 12891.92
## Germany Italy Japan
## 1 13221.28 12147.34 12891.92
Creating New Variables
We already have seen some ways to create new variable. You simply assign them. But there are other ways that can be more handy.
Let’s, for example, create a new column, decade, which may be handy for grouping:
econCopy <- econ # not to lose the original data
econCopy$Decade <- floor(econCopy$Year/10)*10
head(econCopy)## Year Canada France Germany Italy Japan United.Kingdom United.States Decade
## 1 1950 7291 5186 3881 3172 1921 6939 9561 1950
## 2 1951 7533 5461 4206 3451 2126 7123 10116 1950
## 3 1952 7833 5564 4553 3591 2336 7091 10316 1950
## 4 1953 7984 5684 4905 3830 2474 7346 10613 1950
## 5 1954 7699 5915 5247 3947 2582 7619 10359 1950
## 6 1955 8201 6199 5797 4190 2771 7868 10897 1950
or you can use mutate from the tidyverse package:
## Year Canada France Germany Italy Japan United.Kingdom United.States Decade
## 1 1950 7291 5186 3881 3172 1921 6939 9561 1950
## 2 1951 7533 5461 4206 3451 2126 7123 10116 1950
## 3 1952 7833 5564 4553 3591 2336 7091 10316 1950
## 4 1953 7984 5684 4905 3830 2474 7346 10613 1950
## 5 1954 7699 5915 5247 3947 2582 7619 10359 1950
## 6 1955 8201 6199 5797 4190 2771 7868 10897 1950
Grouping
In previous weeks, we have seen how to group the data by some variables. It is worth mentioning here too:
temp <- mutate(econ, Decade = floor(Year/10)*10)
group_by(temp, Decade) %>% summarise(CanadaAvg=mean(Canada), USAvg=mean(United.States))## # A tibble: 7 x 3
## Decade CanadaAvg USAvg
## <dbl> <dbl> <dbl>
## 1 1950 8101 10556.
## 2 1960 10232. 13158.
## 3 1970 14202. 16745.
## 4 1980 17323. 20410
## 5 1990 19324. 24801.
## 6 2000 24006. 30107.
## 7 2010 24941 30491
or we can do it to all countries as:
temp <- mutate(econ, Decade = floor(Year/10)*10)
group_by(temp, Decade) %>% summarise_at(vars(Canada:United.States), mean)## # A tibble: 7 x 8
## Decade Canada France Germany Italy Japan United.Kingdom United.States
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1950 8101 6105. 5517. 4109. 2714. 7614. 10556.
## 2 1960 10232. 9013. 8936. 7024. 6044. 9573. 13158.
## 3 1970 14202. 13067. 12267. 10785. 11389. 11923. 16745.
## 4 1980 17323. 15698. 15044. 14062 15273. 14258. 20410
## 5 1990 19324. 18322. 17198 17103. 19759. 17703. 24801.
## 6 2000 24006. 21254. 19621. 19164. 21269. 23188. 30107.
## 7 2010 24941 21477 20661 18520 21935 23777 30491
Gather and Spread
There is a life saver function, gather and its duality, spread in tidyverse package. May tidyverse be with you.
Let’s see our problem first. Assume you want to plot time series graph of the economies and compare them. To be able to compare them you must add a line, because each of the countries is a new variable:
ggplot(econ, aes(x=Year)) +
geom_line(aes(y=Canada), colour='firebrick') +
geom_line(aes(y=France), colour = 'steelblue') +
geom_line(aes(y=Italy), colour = 'darkolivegreen3') +
geom_text(aes(y=tail(Canada,1), x=2013,label='Canada')) +
geom_text(aes(y=tail(France,1), x=2013,label='France')) +
geom_text(aes(y=tail(Italy,1), x=2013, label='Italy')) +
labs(y='GDPPC')I don’t have time to do it for all.
Instead, you can transform your data using gather as below:
## Year Country GDPPC
## 1 1950 Canada 7291
## 2 1951 Canada 7533
## 3 1952 Canada 7833
## 4 1953 Canada 7984
## 5 1954 Canada 7699
## 6 1955 Canada 8201
## Year Country GDPPC
## 422 2005 United.States 30842
## 423 2006 United.States 31358
## 424 2007 United.States 31655
## 425 2008 United.States 31251
## 426 2009 United.States 29899
## 427 2010 United.States 30491
Now we have a more handy data. Each row contains GDP per capita values of each year and country combination. It is way easier to plot it:
col <- c('firebrick','steelblue','darkolivegreen','purple','black','brown','orange')
ggplot(econ.long, aes(x=Year, y=GDPPC)) +
geom_line(aes(colour=Country)) +
geom_text(data=subset(econ.long, Year==2010), aes(y=GDPPC, x=2010, label=Country), hjust=-.1) +
xlim(c(1950,2017)) +
theme(legend.position = 'none') +
scale_color_manual(values = col)In case you need a data that is wider, you can use spread:
## Year Canada France Germany Italy Japan United.Kingdom United.States
## 1 1950 7291 5186 3881 3172 1921 6939 9561
## 2 1951 7533 5461 4206 3451 2126 7123 10116
## 3 1952 7833 5564 4553 3591 2336 7091 10316
## 4 1953 7984 5684 4905 3830 2474 7346 10613
## 5 1954 7699 5915 5247 3947 2582 7619 10359
## 6 1955 8201 6199 5797 4190 2771 7868 10897
Using pipes
Pipes are the way to write elegant and more readable codes. Without them you write the code, assign the output to a new data, then use that new data and write a code with it and so on.
Instead of assigning to a new data, we can pipe the output into another function and assign the output of all the pipe if we want:
econ.decade <- mutate(econ, Decade = floor(Year/10)*10) %>%
group_by(Decade) %>%
summarise_at(vars(Canada:United.States), mean) %>%
gather(key='Country', value='GDPPC', Canada:United.States)
head(econ.decade)## # A tibble: 6 x 3
## Decade Country GDPPC
## <dbl> <chr> <dbl>
## 1 1950 Canada 8101
## 2 1960 Canada 10232.
## 3 1970 Canada 14202.
## 4 1980 Canada 17323.
## 5 1990 Canada 19324.
## 6 2000 Canada 24006.
or you can plot it directly:
mutate(econ, Decade = floor(Year/10)*10) %>%
group_by(Decade) %>%
summarise_at(vars(Canada:United.States), mean) %>%
gather(key='Country', value='GDPPC', Canada:United.States) %>%
ggplot(aes(x=Decade, y=GDPPC)) +
geom_line(aes(colour=Country)) +
scale_color_manual(values = col) +
ggtitle('GDP Per Capita Decade Averages')Preprocessing Covid Confirmed Data
Sometimes life doesn’t give what we want, we want to drive a yatch in the Ontario lake and make barbecue along with good drinks. In such cases you have to work harder.
For example, in Covid Confirmed Cases dataset by John Hopkins University, the data is painfully and arbitrarily entered. Some entries are in form of 2020-03-17T18:33:03, some are 2020-03-17 18:33:03 and some are 03/17/2020:
## X Province Country Date Confirmed Deaths Recovered
## 1 1 Mainland China 1/22/2020 17:00 1 NA NA
## 2 2 Mainland China 1/22/2020 17:00 14 NA NA
## 3 3 Mainland China 1/22/2020 17:00 6 NA NA
## 4 4 Mainland China 1/22/2020 17:00 1 NA NA
## 5 5 Mainland China 1/22/2020 17:00 NA NA NA
## 6 6 Mainland China 1/22/2020 17:00 26 NA NA
## X Province Country Date Confirmed Deaths
## 277131 277131 Vietnam 2020-06-13 03:33:14 333 0
## 277132 277132 West Bank and Gaza 2020-06-13 03:33:14 489 3
## 277133 277133 Western Sahara 2020-06-13 03:33:14 9 1
## 277134 277134 Yemen 2020-06-13 03:33:14 632 139
## 277135 277135 Zambia 2020-06-13 03:33:14 1321 10
## 277136 277136 Zimbabwe 2020-06-13 03:33:14 343 4
## Recovered
## 277131 323
## 277132 414
## 277133 6
## 277134 28
## 277135 1104
## 277136 51
We have to write some line of codes before working with this pain:
mdy_formatted <- grepl('/', covid$Date) ## mdy formatted lines
ydm_formatted <- !grepl('/', covid$Date) ## ydm formatted lines
covidP1 <- subset(covid, mdy_formatted)
covidP2 <- subset(covid, ydm_formatted)
head(covidP1)## X Province Country Date Confirmed Deaths Recovered
## 1 1 Mainland China 1/22/2020 17:00 1 NA NA
## 2 2 Mainland China 1/22/2020 17:00 14 NA NA
## 3 3 Mainland China 1/22/2020 17:00 6 NA NA
## 4 4 Mainland China 1/22/2020 17:00 1 NA NA
## 5 5 Mainland China 1/22/2020 17:00 NA NA NA
## 6 6 Mainland China 1/22/2020 17:00 26 NA NA
## X Province Country Date Confirmed Deaths Recovered
## 561 561 Mainland China 2020-02-02T23:43:02 11177 350 295
## 562 562 Mainland China 2020-02-02T18:03:05 661 0 32
## 563 563 Mainland China 2020-02-02T18:03:05 632 0 15
## 564 564 Mainland China 2020-02-02T18:03:05 493 2 10
## 565 565 Mainland China 2020-02-02T18:03:05 463 0 16
## 566 566 Mainland China 2020-02-02T18:03:05 340 0 7
covidP1$Date <- mdy_hm(covidP1$Date)
covidP2$Date <- ymd_hms(covidP2$Date)
covid <- rbind(covidP1,covidP2)
covid <- covid[order(covid$Date),]
head(covid)## X Province Country Date Confirmed Deaths Recovered
## 1 1 Mainland China 2020-01-22 17:00:00 1 NA NA
## 2 2 Mainland China 2020-01-22 17:00:00 14 NA NA
## 3 3 Mainland China 2020-01-22 17:00:00 6 NA NA
## 4 4 Mainland China 2020-01-22 17:00:00 1 NA NA
## 5 5 Mainland China 2020-01-22 17:00:00 NA NA NA
## 6 6 Mainland China 2020-01-22 17:00:00 26 NA NA
## X Province Country Date Confirmed Deaths
## 277131 277131 Vietnam 2020-06-13 03:33:14 333 0
## 277132 277132 West Bank and Gaza 2020-06-13 03:33:14 489 3
## 277133 277133 Western Sahara 2020-06-13 03:33:14 9 1
## 277134 277134 Yemen 2020-06-13 03:33:14 632 139
## 277135 277135 Zambia 2020-06-13 03:33:14 1321 10
## 277136 277136 Zimbabwe 2020-06-13 03:33:14 343 4
## Recovered
## 277131 323
## 277132 414
## 277133 6
## 277134 28
## 277135 1104
## 277136 51
The column, confirmed, is the cumulative confirmed cases. For example, the subset of Turkey shows the numbers are increasing:
## X Province Country Date Confirmed Deaths Recovered
## 258600 258600 Turkey 2020-06-08 03:33:22 170132 4692 137969
## 262272 262272 Turkey 2020-06-09 03:33:03 171121 4711 141380
## 265946 265946 Turkey 2020-06-10 04:07:00 172114 4729 144598
## 269663 269663 Turkey 2020-06-11 03:33:41 173036 4746 146839
## 273387 273387 Turkey 2020-06-12 05:09:52 174023 4763 147860
## 277125 277125 Turkey 2020-06-13 03:33:14 175218 4778 149102
However, some countries that have their Province information has province-wise data, which is also cumulative:
## X Province Country Date Confirmed Deaths Recovered
## 258273 258273 Ontario Canada 2020-06-08 03:33:22 32096 2502 24694
## 261946 261946 Ontario Canada 2020-06-09 03:33:03 32395 2524 25002
## 265620 265620 Ontario Canada 2020-06-10 04:07:00 32678 2536 25375
## 269326 269326 Ontario Canada 2020-06-11 03:33:41 32936 2552 25956
## 273047 273047 Ontario Canada 2020-06-12 05:09:52 33173 2563 26358
## 276785 276785 Ontario Canada 2020-06-13 03:33:14 33378 2573 26698
We are more interested in the country-wise numbers, so we will calculate daily sums of Confirmed, Deaths and Recovered columns for each country. For this, we will need
- Dates not Datetime (cut the time)
- Drop Province Column
- Group By Country and Date
- and summarize by calculating the sums
covid.C <- covid %>%
mutate(Date = as.Date(Date)) %>%
select(Country:Recovered) %>%
group_by(Date, Country) %>%
summarize_at(vars(Confirmed:Recovered), sum)
head(covid.C)## # A tibble: 6 x 5
## # Groups: Date [1]
## Date Country Confirmed Deaths Recovered
## <date> <fct> <int> <int> <int>
## 1 2020-01-22 Hong Kong NA NA NA
## 2 2020-01-22 Japan 2 NA NA
## 3 2020-01-22 Macau 1 NA NA
## 4 2020-01-22 Mainland China NA NA NA
## 5 2020-01-22 South Korea 1 NA NA
## 6 2020-01-22 Taiwan 1 NA NA
You can see that some countries has NA, because they have NA entries for some regions and numbers for other. In R, the sum of, e.g, NA + 5 = NA. We can use na.rm = T option:
covid.C <- covid %>%
mutate(Date = as.Date(Date)) %>%
select(Country:Recovered) %>%
group_by(Date, Country) %>%
summarize_at(vars(Confirmed:Recovered), sum, na.rm=T)
head(covid.C)## # A tibble: 6 x 5
## # Groups: Date [1]
## Date Country Confirmed Deaths Recovered
## <date> <fct> <int> <int> <int>
## 1 2020-01-22 Hong Kong 0 0 0
## 2 2020-01-22 Japan 2 0 0
## 3 2020-01-22 Macau 1 0 0
## 4 2020-01-22 Mainland China 547 17 28
## 5 2020-01-22 South Korea 1 0 0
## 6 2020-01-22 Taiwan 1 0 0
ggplot(covid.C, aes(x=Date, y=Confirmed, colour = Country)) +
geom_line() +
theme(legend.position = 'none')label.pos <- covid.C %>% subset(Date == max(Date)) %>% arrange(desc(Confirmed)) %>% head(5)
ggplot(covid.C, aes(x=Date, y=Confirmed)) +
geom_line(aes(colour = Country)) +
geom_text(data=label.pos, aes(x=Date,y=Confirmed, label=Country)) +
theme(legend.position = 'none')What about the G7 countries?
G6 <- c('Canada','France','Germany','Italy','Japan','United Kingdom') # Dropped US because it is an outlier.
label.pos <- subset(covid.C, Country %in% G6 & Date == tail(Date,1))
subset(covid.C, Country %in% G6) %>%
ggplot(aes(x=Date, y=Confirmed)) +
geom_line(aes(colour = Country)) +
geom_text(data=label.pos, aes(x=Date,y=Confirmed, label=Country)) +
theme(legend.position = 'none')Kool Stuff
library('gganimate')
G6 <- c('Canada','France','Germany','Italy','Japan','United Kingdom') # Dropped US because it is an outlier.
label.pos <- subset(covid.C, Country %in% G6 & Date == tail(Date,1))
ggplot(subset(covid.C, Country %in% G6), aes(x=Date, y=Confirmed,colour=Country)) +
geom_line() +
geom_point(size = 2) +
geom_text(aes(x = as.Date('2020-06-01'), label = Country), hjust = 0) +
transition_reveal(Date) +
labs(title = 'Confirmed Cases in Six Countries', y = '# Confirmed') +
theme(plot.margin = margin(5.5, 40, 5.5, 5.5))Is it slowing down?
A “watched but cannot find now” Youtube video is suggesting another approach to understand whether the spread is slowening. The basic idea is as follows. The population growth, whether it is human or bacteria or whatever, can roughly be modelled as below differential equation:
\[\frac{\mathrm{d}P}{\mathrm{d}t} = kP\]
or if we rearrange \[\frac{\mathrm{d}P}{P} = k\mathrm{d}t \Rightarrow \ln P=kt\]
The left side of the equation is the growth benchmarked to the current population and the right side is a constant and time. Interestingly if we take the integral of both sides, we arrive at \(\ln P = kt\).
Enough calculus. We will plot \(\ln P\) on x axis and \(\ln \mathrm d P\) on the y. If the shape is pretty much linear, then it is not slowening, if it there are points below the linear line then it these points indicate the slowening:
covid.US <- subset(covid.C, Country=='US') %>% mutate(NewCases = lag(Confirmed))
covid.US$NewCases <- c(0,diff(covid.US$Confirmed))
ggplot(covid.US, aes(x=log(Confirmed), y=log(NewCases))) +
geom_point() covid.CAD <- subset(covid.C, Country=='Canada')
covid.CAD$NewCases <- c(0,diff(covid.CAD$Confirmed))
ggplot(covid.CAD, aes(x=log(Confirmed), y=log(NewCases))) +
geom_point() It looks we are getting better!
Time Series Decomposition
Time series data usually have some components, a trend, seasonality and noise.
We will show the time series using AirPassengers data loaded in R. This is a ts object which has a very different structure that is not suitable for ggplot:
## [1] "ts"
As it can be seen in the above figure, there is a jigsaw pattern that is widening by time, and also an upside trend. The question becomes whether the movements can be explain with trend and seasonality, or there are some regions that exceeds these.
To answer these questions we will use decompose and stl functions in R. Let’s start with the first:
The decompose function split the data into the components. If we sum each data vertically we will arrive at the observed series.
We can see that there is an upside trend and the seasonality has the similar pattern as the jigsaws that we had in the original data. However, in the remainder (random), there is an obvious pattern. So, the decompose function couldn’t explain the patterns in the original data.
If you focus your attention to seasonal part, you can see that the patters are exactly the same. This is because they are calculated by taking the annual averages of each month (Avg. of January, Avg. of February …). Decompose function assumes that the seasonality is the average.
stl is more suitable for this analysis. It applies local smoothers or averages to extract seasonality. The s.window parameter controls numbers to be averaged/smoothed, that is it selects a window (range) of values, and takes the average of each number and then slides the window from left to right. If we also add s.degree=1, then it doesn’t calculate the average but smooths. Similarly, if we add t.degree=1 it will fit a smoother to a larger window to smooth the trend:
The vertical bars on the right indicates a range of errors. If it is wider than the plot, the plot can be statistically insignificant. For example, the trend and seasonality plots are wider than the bar on the right, but the remainder part is narrower than the bar, so the remainder can be 0. Remainder can be 0 also means that the stl can explain the patterns with trend and seasonality it has extracted.